Effective Data Cleaning with Continuous Evaluation
نویسنده
چکیده
Enterprises have been acquiring large amounts of data from a variety of sources to build their own “Data Lakes”, with the goal of enriching their data asset and enabling richer and more informed analytics. The pace of the acquisition and the variety of the data sources make it impossible to clean this data as it arrives. This new reality has made data cleaning a continuous process and a part of day-to-day data processing activities. The large body of data cleaning algorithms and techniques is strong evidence of how complex the problem is, yet, it has had little success in being adopted in real-world data cleaning applications. In this article we examine how the community has been evaluating the effectiveness of data cleaning algorithms, and if current data cleaning proposals are solving the right problems to enable the development of deployable and effective solutions.
منابع مشابه
Data quality analysis and cleaning strategy for wireless sensor networks
The quality of data in wireless sensor networks has a significant impact on decision support, and data cleaning is an effective way to improve data quality. However, if the data cleaning strategies are not correctly designed, it might result in an unsatisfactory cleaning effect with increased system cleaning costs. Initially, data quality evaluation indicators and their measurement methods in w...
متن کاملDissemination of Models over Time-Varying Data
Dissemination of time-varying data is essential in many applications, such as sensor networks, patient monitoring, stock tickers, etc. Often, the raw data have to go through some form of pre-processing, such as cleaning, smoothing, etc, before being disseminated. Such pre-processing often applies mathematical or statistical models to transform the large volumes of raw, point-based data into a m...
متن کاملCUNI at the ShARe/CLEF eHealth Evaluation Lab 2014
This report describes the participation of the team of Charles University in Prague at the ShARe/CLEF eHealth Evaluation Lab in 2014. We took part in Task 3 (User-Centered Health Information Retrieval) and its both subtasks (monolingual and multilingual retrieval). Our system was based on the Terrier platform and its implementation of the Hiemstra retrieval model. We experimented with several m...
متن کاملEvaluation of conscript\'s opinion about Continuous Medical Teaching through 5th educational program of country`s medical science.
Introduction. Today Continuous Medical Education is under consideration for graduated medical students as a necessity in today's world and has had its importance about one decade in our country (IRAN) . Methods. Recognition of opinions in various occupational actions of post graduated persons is very important in attention for reevaluation the program of continuous teaching in post graduated...
متن کاملEvaluation of the type A uncertainty in measurements with autocorrelated observations
Described is the proposal of application of the GUM uncertainty type A evaluation to measurements with auto-correlated observations. The first step to it is the identification and cleaning of the raw sample data from the regularly variable components. Then formulas for standard deviation of the sample and of the mean value are expressed with the use correction coefficients or the so-called "eff...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Data Eng. Bull.
دوره 39 شماره
صفحات -
تاریخ انتشار 2016